1 Introduction


In this kernel we are going to perform a statistical text analysis on the Star Wars scripts from The Original Trilogy Episodes (IV, V and VI), using wordclouds to show the most frequent words. The input files used for the analysis are avaliable here. This post is my particular tribute to the Star Wars Day, on May 4.


2 Loading data


# Load libraries
library(tidyverse)
library(tm)
library(wordcloud)
library(wordcloud2)
library(tidytext)
library(reshape2)

# Read the data
ep4 <- read.table("SW_EpisodeIV.txt")
ep5 <- read.table("SW_EpisodeV.txt")
ep6 <- read.table("SW_EpisodeVI.txt")

3 Functions


The first function performs cleaning and preprocessing steps to a corpus:

# Text transformations
cleanCorpus <- function(corpus){
  
  corpus.tmp <- tm_map(corpus, removePunctuation)
  corpus.tmp <- tm_map(corpus.tmp, stripWhitespace)
  corpus.tmp <- tm_map(corpus.tmp, content_transformer(tolower))
  v_stopwords <- c(stopwords("english"), c("thats","weve","hes","theres","ive",
                                           "will","can","cant","dont","youve",
                                           "youre","youll","theyre","whats","didnt","us"))
  corpus.tmp <- tm_map(corpus.tmp, removeWords, v_stopwords)
  corpus.tmp <- tm_map(corpus.tmp, removeNumbers)
  return(corpus.tmp)
  
}

The second function constructs the term-document matrix, that describes the frequency of terms that occur in a collection of documents. This matrix has terms in the first column and documents across the top as individual column names.

# Most frequent terms 
frequentTerms <- function(text){
  
  s.cor <- Corpus(VectorSource(text))
  s.cor.cl <- cleanCorpus(s.cor)
  s.tdm <- TermDocumentMatrix(s.cor.cl)
  s.tdm <- removeSparseTerms(s.tdm, 0.99)
  m <- as.matrix(s.tdm)
  word_freqs <- sort(rowSums(m), decreasing=TRUE)
  dm <- data.frame(word=names(word_freqs), freq=word_freqs)
  return(dm)
  
}

4 Episode IV: A New Hope


# How many dialogues?
length(ep4$dialogue)
## [1] 1010
# How many characters?
length(levels(ep4$character))
## [1] 60
# Top 20 characters with more dialogues 
top.ep4.chars <- as.data.frame(sort(table(ep4$character), decreasing=TRUE))[1:20,]

# Visualization 
ggplot(data=top.ep4.chars, aes(x=Var1, y=Freq)) +
  geom_bar(stat="identity", fill="#56B4E9", colour="black") +
  theme(axis.text.x=element_text(angle=45, hjust=1)) +
  labs(x="Character", y="Number of dialogues")

# Wordcloud for Episode IV
wordcloud2(frequentTerms(ep4$dialogue), size=0.5,
           figPath="../Star-Wars-Movie-Scripts/vader.png")


5 Episode V: The Empire Strikes Back


# How many dialogues?
length(ep5$dialogue)
## [1] 839
# How many characters?
length(levels(ep5$character))
## [1] 49
# Top 20 characters with more dialogues 
top.ep5.chars <- as.data.frame(sort(table(ep5$character), decreasing=TRUE))[1:20,]

# Visualization 
ggplot(data=top.ep5.chars, aes(x=Var1, y=Freq)) +
  geom_bar(stat="identity", fill="#56B4E9", colour="black") +
  theme(axis.text.x=element_text(angle=45, hjust=1)) +
  labs(x="Character", y="Number of dialogues")

# Wordcloud for Episode V
wordcloud2(frequentTerms(ep5$dialogue), size=0.5,
           figPath="../Star-Wars-Movie-Scripts/yoda.png")


6 Episode VI: Return of the Jedi


# How many dialogues?
length(ep6$dialogue)
## [1] 674
# How many characters?
length(levels(ep6$character))
## [1] 53
# Top 20 characters with more dialogues 
top.ep6.chars <- as.data.frame(sort(table(ep6$character), decreasing=TRUE))[1:20,]

# Visualization 
ggplot(data=top.ep6.chars, aes(x=Var1, y=Freq)) +
  geom_bar(stat="identity", fill="#56B4E9", colour="black") +
  theme(axis.text.x=element_text(angle=45, hjust=1)) +
  labs(x="Character", y="Number of dialogues")

# Wordcloud for Episode VI
wordcloud2(frequentTerms(ep6$dialogue), size=0.5,
           figPath="../Star-Wars-Movie-Scripts/r2d2.png")


7 The Original Trilogy


In this section we are going to compute the previous statistics, but now considering the three movies of The Original Trilogy (Episodes IV, V and VI).

# The Original Trilogy dialogues 
trilogy <- rbind(ep4, ep5, ep6)

# How many dialogues?
length(trilogy$dialogue)
## [1] 2523
# How many characters?
length(levels(trilogy$character))
## [1] 129
# Top 20 characters with more dialogues 
top.chars <- as.data.frame(sort(table(trilogy$character), decreasing=TRUE))[1:20,]

# Visualization 
ggplot(data=top.chars, aes(x=Var1, y=Freq)) +
  geom_bar(stat="identity", fill="#56B4E9", colour="black") +
  theme(axis.text.x=element_text(angle=45, hjust=1)) +
  labs(x="Character", y="Number of dialogues")

C-3PO with more dialogues than Leia and Darth Vader? Ugh…

# Wordcloud for The Original Trilogy
wordcloud2(frequentTerms(trilogy$dialogue), size=0.4,
           figPath="../Star-Wars-Movie-Scripts/rebel alliance.png")


7.1 Sentiment analysis


Let’s address the topic of opinion mining or sentiment analysis. We can use the tools of text mining to approach the emotional content of text programmatically.

# Transform the text to a tidy data structure with one token per row
tokens <- trilogy %>%  
  mutate(dialogue=as.character(trilogy$dialogue)) %>%
  unnest_tokens(word, dialogue)

First we are going to use the general-purpose lexicon bing, from Bing Liu and collaborators. The bing lexicon categorizes words in a binary fashion into positive and negative categories.

# Positive and negative words
tokens %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort=TRUE) %>%
  acast(word ~ sentiment, value.var="n" , fill=0) %>%
  comparison.cloud(colors=c("#F8766D", "#00BFC4"), max.words=100)

The nrc lexicon (from Saif Mohammad and Peter Turney) categorizes words in a binary fashion into categories of positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust.

# Sentiments and frequency associated with each word  
sentiments <- tokens %>% 
  inner_join(get_sentiments("nrc")) %>%
  count(word, sentiment, sort=TRUE) 

# Frequency of each sentiment
ggplot(data=sentiments, aes(x=reorder(sentiment, -n, sum), y=n)) + 
geom_bar(stat="identity", aes(fill=sentiment), show.legend=FALSE) +
labs(x="Sentiment", y="Frequency")

We can use this lexicon to compute the most frequent words for each sentiment.

# Top 10 terms for each sentiment
sentiments %>%
  group_by(sentiment) %>%
  arrange(desc(n)) %>%
  slice(1:10) %>%
  ggplot(aes(x=reorder(word, n), y=n)) +
  geom_col(aes(fill=sentiment), show.legend=FALSE) +
  facet_wrap(~sentiment, scales="free_y") +
  labs(y="Frequency", x="Terms") +
  coord_flip() 


7.2 Analysis by character


In the following visualizations we only consider the Top 10 characters with more dialogues.

# Sentiment analysis for the Top 10 characters with more dialogues
tokens %>%
  filter(character %in% c("LUKE", "HAN", "THREEPIO", "LEIA", "VADER",
                          "BEN", "LANDO", "YODA", "EMPEROR", "RED LEADER")) %>%
  inner_join(get_sentiments("nrc")) %>%
  count(character, sentiment, sort=TRUE) %>%
  ggplot(aes(x=sentiment, y=n)) +
  geom_col(aes(fill=sentiment), show.legend=FALSE) +
  facet_wrap(~character, scales="free_x") +
  labs(x="Sentiment", y="Frequency") +
  coord_flip()

To calculate the most frequent words for each character, we are going to use a different approach than the term-document matrix: the tidy way.

# Stopwords
mystopwords <- data_frame(word=c(stopwords("english"), 
                                 c("thats","weve","hes","theres","ive",
                                   "will","can","cant","dont","youve",
                                   "youre","youll","theyre","whats","didnt","us")))

# Tokens without stopwords
tokens.top.chars <- trilogy %>%
  mutate(dialogue=as.character(trilogy$dialogue)) %>%
  filter(character %in% c("LUKE", "HAN", "THREEPIO", "LEIA", "VADER",
                          "BEN", "LANDO", "YODA", "EMPEROR", "RED LEADER")) %>%
  unnest_tokens(word, dialogue) %>%
  anti_join(mystopwords, by="word")

# Most frequent words for each character
tokens.top.chars %>%
  count(character, word) %>%
  group_by(character) %>% 
  arrange(desc(n)) %>%
  slice(1:10) %>%
  ungroup() %>%
  mutate(word2=factor(paste(word, character, sep="__"), 
                       levels=rev(paste(word, character, sep="__"))))%>%
  ggplot(aes(x=word2, y=n)) +
  geom_col(aes(fill=character), show.legend=FALSE) +
  facet_wrap(~character, scales="free_y") +
  labs(x="Sentiment", y="Frequency") +
  scale_x_discrete(labels=function(x) gsub("__.+$", "", x)) +
  coord_flip()  

What is the problem with this visualization? Some words are generic and meaningless. We can use the bind_tf_idf() function to obtain more relevant and characteristic terms associated with each character. The idea of tf–idf (term frequency - inverse document frequency) is to find the important words for the content of each document by decreasing the weight for commonly used words and increasing the weight for words that are not used very much in a collection or corpus of documents. If the term appears in all documents, it is not likely to be insightful.

# Most relevant words for each character
tokens.top.chars %>%
  count(character, word) %>%
  bind_tf_idf(word, character, n) %>%
  group_by(character) %>% 
  arrange(desc(tf_idf)) %>%
  slice(1:10) %>%
  ggplot(aes(x=reorder(word, tf_idf), y=tf_idf)) +
  geom_col(aes(fill=character), show.legend=FALSE) +
  facet_wrap(~character, scales="free_y") +
  theme(axis.text.x=element_text(angle=45, hjust=1)) +
  labs(y="tf–idf", x="Sentiment") +
  coord_flip()

These words are, as measured by tf–idf, the most important to each character. We can identify most of them by only seeing the words.


8 Summary


In this entry we have analyzed the Star Wars scripts from The Original Trilogy Episodes by performing a statistical text analysis, including:

It has been a pleasure to make this post, I have learned a lot! Thank you for reading and if you like it, please upvote it.

NOTE: I have had a lot of problems with the renderization of the wordclouds in Kaggle. In order to solve it, I have exported the images from RStudio and I have published them in Imgur, using the URLs in this kernel.


9 References


Hadley Wickham (2017). tidyverse: Easily Install and Load the ‘Tidyverse’. R package version 1.2.1. https://CRAN.R-project.org/package=tidyverse

Ingo Feinerer and Kurt Hornik (2017). tm: Text Mining Package. R package version 0.7-3. https://CRAN.R-project.org/package=tm

Ian Fellows (2014). wordcloud: Word Clouds. R package version 2.5. https://CRAN.R-project.org/package=wordcloud

Dawei Lang and Guan-tin Chien (2018). wordcloud2: Create Word Cloud by ‘htmlwidget’. R package version 0.2.1. https://CRAN.R-project.org/package=wordcloud2

Silge J, Robinson D (2016). “tidytext: Text Mining and Analysis Using Tidy Data Principles in R.” JOSS, 1(3). doi: 10.21105/joss.00037 (URL: http://doi.org/10.21105/joss.00037), <URL: http://dx.doi.org/10.21105/joss.00037>.

Hadley Wickham (2007). Reshaping Data with the reshape Package. Journal of Statistical Software, 21(12), 1-20. URL http://www.jstatsoft.org/v21/i12/.